Off-Policy Actor-Critic

نویسندگان

  • Thomas Degris
  • Martha White
  • Richard S. Sutton
چکیده

This paper presents the first actor-critic algorithm for off-policy reinforcement learning. Our algorithm is online and incremental, and its per-time-step complexity scales linearly with the number of learned weights. Previous work on actor-critic algorithms is limited to the on-policy setting and does not take advantage of the recent advances in offpolicy gradient temporal-difference learning. Off-policy techniques, such as Greedy-GQ, enable a target policy to be learned while following and obtaining data from another (behavior) policy. For many problems, however, actor-critic methods are more practical than action value methods (like Greedy-GQ) because they explicitly represent the policy; consequently, the policy can be stochastic and utilize a large action space. In this paper, we illustrate how to practically combine the generality and learning potential of offpolicy learning with the flexibility in action selection given by actor-critic methods. We derive an incremental, linear time and space complexity algorithm that includes eligibility traces, prove convergence under assumptions similar to previous off-policy algorithms, and empirically show better or comparable performance to existing algorithms on standard reinforcement-learning benchmark problems. The reinforcement learning framework is a general temporal learning formalism that has, over the last few decades, seen a marked growth in algorithms and applications. Until recently, however, practical online Appearing in Proceedings of the 29 th International Conference on Machine Learning, Edinburgh, Scotland, UK, 2012. Copyright 2012 by the author(s)/owner(s). methods with convergence guarantees have been restricted to the on-policy setting, in which the agent learns only about the policy it is executing. In an off-policy setting, on the other hand, an agent learns about a policy or policies different from the one it is executing. Off-policy methods have a wider range of applications and learning possibilities. Unlike onpolicy methods, off-policy methods are able to, for example, learn about an optimal policy while executing an exploratory policy (Sutton & Barto, 1998), learn from demonstration (Smart & Kaelbling, 2002), and learn multiple tasks in parallel from a single sensorimotor interaction with an environment (Sutton et al., 2011). Because of this generality, off-policy methods are of great interest in many application domains. The most well known off-policy method is Q-learning (Watkins & Dayan, 1992). However, while Q-Learning is guaranteed to converge to the optimal policy for the tabular (non-approximate) case, it may diverge when using linear function approximation (Baird, 1995). Least-squares methods such as LSTD (Bradtke & Barto, 1996) and LSPI (Lagoudakis & Parr, 2003) can be used off-policy and are sound with linear function approximation, but are computationally expensive; their complexity scales quadratically with the number of features and weights. Recently, these problems have been addressed by the new family of gradientTD (Temporal Difference) methods (e.g., Sutton et al., 2009), such as Greedy-GQ (Maei et al., 2010), which are of linear complexity and convergent under off-policy training with function approximation. All action-value methods, including gradient-TD methods such as Greedy-GQ, suffer from three important limitations. First, their target policies are deterministic, whereas many problems have stochastic optimal policies, such as in adversarial settings or in partially observable Markov decision processes. Second, finding the greedy action with respect to the actionOff-Policy Actor-Critic value function becomes problematic for larger action spaces. Finally, a small change in the action-value function can cause large changes in the policy, which creates difficulties for convergence proofs and for some real-time applications. The standard way of avoiding the limitations of actionvalue methods is to use policy-gradient algorithms (Sutton et al., 2000) such as actor-critic methods (e.g., Bhatnagar et al., 2009). For example, the natural actor-critic, an on-policy policy-gradient algorithm, has been successful for learning in continuous action spaces in several robotics applications (Peters & Schaal, 2008). The first and main contribution of this paper is to introduce the first actor-critic method that can be applied off-policy, which we call Off-PAC, for Off-Policy Actor–Critic. Off-PAC has two learners: the actor and the critic. The actor updates the policy weights. The critic learns an off-policy estimate of the value function for the current actor policy, different from the (fixed) behavior policy. This estimate is then used by the actor to update the policy. For the critic, in this paper we consider a version of Off-PAC that uses GTD(λ) (Maei, 2011), a gradient-TD method with eligibitity traces for learning state-value functions. We define a new objective for our policy weights and derive a valid backward-view update using eligibility traces. The time and space complexity of Off-PAC is linear in the number of learned weights. The second contribution of this paper is an off-policy policy-gradient theorem and a convergence proof for Off-PAC when λ = 0, under assumptions similar to previous off-policy gradient-TD proofs. Our third contribution is an empirical comparison of Q(λ), Greedy-GQ, Off-PAC, and a soft-max version of Greedy-GQ that we call Softmax-GQ, on three benchmark problems in an off-policy setting. To the best of our knowledge, this paper is the first to provide an empirical evaluation of gradient-TD methods for off-policy control (the closest known prior work is the work of Delp (2011)). We show that Off-PAC outperforms other algorithms on these problems. 1. Notation and Problem Setting In this paper, we consider Markov decision processes with a discrete state space S, a discrete action spaceA, a distribution P : S × S ×A → [0, 1], where P (s′|s, a) is the probability of transitioning into state s′ from state s after taking action a, and an expected reward function R : S×A×S → R that provides an expected reward for taking action a in state s and transitioning into s′. We observe a stream of data, which includes states st ∈ S, actions at ∈ A, and rewards rt ∈ R for t = 1, 2, . . . with actions selected from a fixed behavior policy, b(a|s) ∈ (0, 1]. Given a termination condition γ : S → [0, 1] (Sutton et al., 2011), we define the value function for π : S×A → (0, 1] to be: V (s) = E [rt+1 + . . .+ rt+T |st = s] ∀s ∈ S (1) where policy π is followed from time step t and terminates at time t + T according to γ. We assume termination always occurs in a finite number of steps. The action-value function, Q(s, a), is defined as: Q(s, a) = ∑ s′∈S P (s′|s, a)[R(s, a, s′) + γ(s′)V π,γ(s′)] (2) for all a ∈ A and for all s ∈ S. Note that V (s) = ∑ a∈A π(a|s)Q(s, a), for all s ∈ S. The policy πu : A×S → [0, 1] is an arbitrary, differentiable function of a weight vector, u ∈ Ru , Nu ∈ N, with πu(a|s) > 0 for all s ∈ S, a ∈ A. Our aim is to choose u so as to maximize the following scalar objective function:

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Convergent Actor-Critic Algorithms Under Off-Policy Training and Function Approximation

We present the first class of policy-gradient algorithms that work with both state-value and policy function-approximation, and are guaranteed to converge under off-policy training. Our solution targets problems in reinforcement learning where the action representation adds to thecurse-of-dimensionality; that is, with continuous or large action sets, thus making it infeasible to estimate state-...

متن کامل

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. However, these methods typically suffer from two major challenges: very high sample complexity and brittle convergence properties, which necessitate meticulous hyperparameter tuning. Both of these challenges severely limit the applicability of such methods t...

متن کامل

The Reactor: A Sample-Efficient Actor-Critic Architecture

In this work we present a new reinforcement learning agent, called Reactor (for Retraceactor), based on an off-policy multi-step return actor-critic architecture. The agent uses a deep recurrent neural network for function approximation. The network outputs a target policy π (the actor), an action-value Q-function (the critic) evaluating the current policy π, and an estimated behavioural policy...

متن کامل

A Batch, Off-Policy, Actor-Critic Algorithm for Optimizing the Average Reward

We develop an off-policy actor–critic algorithm for learning an optimal policy from a training set composed of data from multiple individuals. This algorithm is developed with a view toward its use in mobile health.

متن کامل

Linear Off-Policy Actor-Critic

This paper presents the first actor-critic algorithm for o↵-policy reinforcement learning. Our algorithm is online and incremental, and its per-time-step complexity scales linearly with the number of learned weights. Previous work on actor-critic algorithms is limited to the on-policy setting and does not take advantage of the recent advances in o↵policy gradient temporal-di↵erence learning. O↵...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1205.4839  شماره 

صفحات  -

تاریخ انتشار 2012